1 Introduction:

Team A are the following members: Amal Alqahtani, Jiaxiang Peng, Naureen Elahi, and Xinya Mu. You may find our work over on GitHub.

Novel Coronavirus is a contagious respiratory virus that first started in Wuhan in December 2019. On 2/11/2020, the disease was officially named COVID-19 by the World Health Organization (WHO). To date, the CDC states there are a total of 4,542, 579 cases and 152, 870 deaths in the United States (Cases in U.S, 2020). Many risk factors have been hypothesized to affect the case and death rates from the virus.

We felt that a relevant discussion needed would be: What are the most regions with the highest number of deaths? What can we say about patient demographics? Is race considered a significant risk factor for increased COVID-19 incidence in the United States?’ Are there any general trends amongst underlying health conditions? These questions are all suited to Exploratory Data Analysis (EDA), and with these questions in mind, we want to see if we could find data on COVID-19 that would be readily available for us to analyze. Eventually, our question morphed into the following: What are the factors (i.e. patient demographics, social determinants of health, environmental variables, underlying health conditions, country of origin) affecting COVID-19 numbers of cases and death rate?

1.1 What do you know about this dataset

For this project, we used a public dataset called Covid-19 on Github over here: https://github.com/johndurbin93/Covid-19-Dataset

It includes COVID-19 confirmed case and death numbers through April 14, 2020 which were obtained for each U.S. county from the Center for Systems Science and Engineering (CSSE) Coronavirus Resource Center at Johns Hopkins University. Race demographics for counties was obtained from the County Health Rankings and Roadmaps Program database. Daily temperature data for counties was obtained from the National Oceanic and Atmospheric Administration. County temperature was calculated using mean temperature for a period starting 10 days before the first confirmed county case and through the most current date (April 14, 2020). Unacast social distancing data was obtained through a research agreement with the company (Li et al., 2020).

The data looks like the following:

tibble [3,144 x 82] (S3: tbl_df/tbl/data.frame)
 $ Province                                                                                               : chr [1:3144] "New York City" "Nassau" "Suffolk" "Westchester" ...
 $ State                                                                                                  : chr [1:3144] "New York" "New York" "New York" "New York" ...
 $ Latitude                                                                                               : num [1:3144] 40.8 40.7 40.9 41.2 41.8 ...
 $ Longitude                                                                                              : num [1:3144] -74 -73.6 -72.8 -73.8 -87.8 ...
 $ Tests                                                                                                  : num [1:3144] 499143 499143 499143 499143 110616 ...
 $ Days Since 1st Case                                                                                    : num [1:3144] 44 41 38 43 82 35 41 80 39 38 ...
 $ total_cases                                                                                            : num [1:3144] 110465 25250 22691 20191 16323 ...
 $ deaths                                                                                                 : num [1:3144] 7905 1001 608 596 577 ...
 $ Population (for demographic %'s)                                                                       : chr [1:3144] "8623000" "1358343" "1481093" "967612" ...
 $ % less than 18 years of age                                                                            : chr [1:3144] "20.9" "21.459675499999999" "21.134324500000002" "21.900513799999999" ...
 $ % 65 and over                                                                                          : chr [1:3144] "14.1" "17.763039200000001" "16.862951899999999" "17.053116299999999" ...
 $ % Black                                                                                                : chr [1:3144] "24.3" "11.6331442" "7.3924459799999998" "13.8042935" ...
 $ % American Indian & Alaska Native                                                                      : chr [1:3144] "0.4" "0.54294092000000005" "0.61373593999999998" "0.95647842000000005" ...
 $ % Asian                                                                                                : chr [1:3144] "13.9" "10.4504532" "4.1896086199999996" "6.43553408" ...
 $ % Native Hawaiian/Other Pacific Islander                                                               : chr [1:3144] "0.1" "0.1" "9.5899999999999999E-2" "0.13228443000000001" ...
 $ % Hispanic                                                                                             : chr [1:3144] "29.1" "17.231362000000001" "19.775260599999999" "25.140345499999999" ...
 $ % Non-Hispanic White                                                                                   : chr [1:3144] "32.1" "59.333835399999998" "67.190378999999993" "53.088118000000001" ...
 $ % Not Proficient in English                                                                            : chr [1:3144] "9" "5.3660427200000003" "4.00639637" "6.3180527499999997" ...
 $ % Female                                                                                               : chr [1:3144] "52.3" "51.306334300000003" "50.771693599999999" "51.559095999999997" ...
 $ % Rural                                                                                                : chr [1:3144] "0" "0.19223132000000001" "2.6011316799999999" "3.2734774500000001" ...
 $ Population Density per Square mile of Land (2010)                                                      : num [1:3144] 69468 4705 1637 2205 5495 ...
 $ Housing Density Per Square Mile of Land                                                                : num [1:3144] 37106 1645 625 861 2306 ...
 $ Avg Daily March 2011 Sunlight (KJ/m2) Missing HI and AK                                                : num [1:3144] 16233 16649 16539 15031 14299 ...
 $ GDP 2018                                                                                               : num [1:3144] 600244287 81196003 81211899 73404644 362063569 ...
 $ GDP/capita                                                                                             : num [1:3144] 69.6 59.8 54.8 75.9 69.9 ...
 $ Percentage Living in Poverty, All Ages, 2016                                                           : num [1:3144] 17.2 6.1 7.6 10 15 22.9 6.9 16.3 14.4 15.6 ...
 $ Air Quality, Annual Average Ambient Concentrations of PM2.5, 2014                                      : chr [1:3144] "10.8" "10" "9" "10.4" ...
 $ Primary Care Physicians Ratio                                                                          : chr [1:3144] "31.417361111111109" "29.834027777777777" "56.709027777777777" "30.334027777777777" ...
 $ Dentist Ratio                                                                                          : chr [1:3144] "23.334027777777777" "34.500694444444441" "50.209027777777777" "37.834027777777777" ...
 $ Mental Health Provider Ratio                                                                           : chr [1:3144] "4.834027777777778" "13.792361111111111" "15.625694444444443" "10.750694444444443" ...
 $ High School Graduation Rate                                                                            : chr [1:3144] "74.536495200000005" "90.769602500000005" "89.560601599999998" "89.554779199999999" ...
 $ % Some College                                                                                         : chr [1:3144] "84.074597800000006" "75.579882699999999" "67.067068199999994" "71.893479799999994" ...
 $ % Unemployed                                                                                           : chr [1:3144] "3.6720665100000001" "3.5355112100000001" "3.8509406199999998" "3.8880261699999998" ...
 $ % Children in Poverty                                                                                  : chr [1:3144] "19.7" "7.6" "9.4" "10.3" ...
 $ Income Inequality Ratio (80th%/20th%)                                                                  : chr [1:3144] "9.2065919600000008" "4.5137498100000002" "4.3752126000000002" "6.18534249" ...
 $ % Single-Parent Households                                                                             : chr [1:3144] "39.575203500000001" "19.238140600000001" "23.6102569" "25.424638399999999" ...
 $ Social Association Rate                                                                                : chr [1:3144] "12.8789886" "7.9882352399999998" "6.7450214400000004" "8.3754656999999995" ...
 $ Violent Crime Rate                                                                                     : chr [1:3144] "586.40744800000004" "143.663387" "124.039181" "220.606166" ...
 $ Air pollution: Average Daily PM2.5                                                                     : chr [1:3144] "10.8" "10" "9" "10.4" ...
 $ Presence of Drinking Water Violation                                                                   : chr [1:3144] "No" "No" "No" "Yes" ...
 $ % Severe Housing Problems                                                                              : chr [1:3144] "24.378637699999999" "21.324080599999998" "22.888761800000001" "24.236306200000001" ...
 $ Housing: Severe Cost Burden                                                                            : chr [1:3144] "19.610767299999999" "19.1674103" "20.427237699999999" "20.895964899999999" ...
 $ Housing: Overcrowding                                                                                  : chr [1:3144] "5.4547143900000004" "2.5236808000000002" "2.6421104199999998" "4.2602996299999996" ...
 $ Housing: Inadequate Facilities                                                                         : chr [1:3144] "1.2204915199999999" "0.72802853000000001" "0.78609931" "0.73443351999999995" ...
 $ % Drive Alone to Work                                                                                  : chr [1:3144] "6.0475223400000004" "68.609857500000004" "79.604339800000005" "57.587820100000002" ...
 $ % Long Commute - Drives Alone                                                                          : chr [1:3144] "66.7" "45.7" "41.9" "41.2" ...
 $ Sleep <7 Hours_Percent                                                                                 : chr [1:3144] "NA" "38.049835600000002" "35.608102700000003" "33.101763800000001" ...
 $ Sleep <7 Hours_CI_Low                                                                                  : chr [1:3144] "NA" "37.488497199999998" "34.960704200000002" "32.608553999999998" ...
 $ Sleep <7 Hours_CI_High                                                                                 : chr [1:3144] "NA" "38.576512200000003" "36.198949800000001" "33.594731299999999" ...
 $ Diabetes Total Percentage                                                                              : num [1:3144] 6.5 7.2 6.8 6.4 9 10.3 6.8 8.1 6.9 8.2 ...
 $ Diabetes Male Percentage                                                                               : num [1:3144] 6.7 8.5 7.7 6.7 9.7 10.7 7.1 8.6 7 8.7 ...
 $ Diabetes Female Percentage                                                                             : num [1:3144] 6.3 6.2 6 6.2 8.4 10.1 6.6 7.7 6.8 7.9 ...
 $ Coronary Heart Disease Death Rate per 100,000, All Ages, All Races/Ethnicities, Both Genders, 2014-2016: num [1:3144] 100.4 142.4 120.1 97.6 95.2 ...
 $ Hypertension Death Rate per 100,000 (any mention), 35+, All Races/Ethnicities, Both Genders, 2014-2016 : num [1:3144] 232 153 181 124 191 ...
 $ Obesity, Age-Adjusted Percentage, 20+. 2015                                                            : num [1:3144] 15.9 22.5 23.6 20.2 27.2 34.1 22.4 21.2 23.4 23.9 ...
 $ % Fair or Poor Health                                                                                  : chr [1:3144] "15.610279800000001" "12.0544118" "13.0711332" "14.8011888" ...
 $ Average Number of Physically Unhealthy Days                                                            : chr [1:3144] "3.5938226700000002" "2.8691053700000002" "3.1473144999999998" "3.1513169799999998" ...
 $ Average Number of Mentally Unhealthy Days                                                              : chr [1:3144] "3.97126146" "3.4601849699999998" "3.9316660200000002" "3.9107989299999999" ...
 $ % Low Birthweight                                                                                      : chr [1:3144] "8.2870096600000007" "7.8873580399999996" "7.7408509700000003" "7.95359718" ...
 $ % Smokers (adults)                                                                                     : chr [1:3144] "12.418234200000001" "11.225364600000001" "12.625481499999999" "11.371546" ...
 $ % Adults with Obesity                                                                                  : chr [1:3144] "14.6" "23.6" "24.6" "20.7" ...
 $ Food Environment Index                                                                                 : chr [1:3144] "8.3000000000000007" "9.6999999999999993" "9.3000000000000007" "9.1" ...
 $ % Physically Inactive                                                                                  : chr [1:3144] "17.5" "22.8" "24.2" "21.2" ...
 $ % With Access to Exercise Opportunities                                                                : chr [1:3144] "100" "98.858183299999993" "93.3366592" "99.621119899999997" ...
 $ % Excessive Drinking                                                                                   : chr [1:3144] "24.812851999999999" "18.439903699999999" "18.671426799999999" "18.011370899999999" ...
 $ % Uninsured                                                                                            : chr [1:3144] "6.15572813" "5.32768102" "5.4469207300000004" "6.9293390300000004" ...
 $ Preventable Hospitalization Rate (Preventable hospital stays)                                          : chr [1:3144] "3082" "3588" "4339" "3870" ...
 $ % With Annual Mammogram                                                                                : chr [1:3144] "39" "45" "42" "46" ...
 $ % Flu Vaccinated                                                                                       : chr [1:3144] "46" "52" "51" "51" ...
 $ Chronic Respiratory Disease: mortality rate per 100K (2014)                                            : chr [1:3144] "23.47" "29.03" "38.590000000000003" "31.82" ...
 $ Liver Disease: crude mortality rate per 100K (1999-2018)                                               : chr [1:3144] "7.3202151400000002" "7.8321364999999998" "9.3999156199999998" "8.6888457900000002" ...
 $ Liver Disease: % of Total Deaths (1999-2018)                                                           : chr [1:3144] "2.72836E-3" "2.4593800000000002E-3" "3.2464299999999998E-3" "1.92728E-3" ...
 $ Liver Disease: crude mortality rate per 100K (2018)                                                    : chr [1:3144] "6.6924499900000001" "8.7606738499999999" "11.478009800000001" "8.9912072199999997" ...
 $ Liver Disease: % of Total Deaths (2018)                                                                : chr [1:3144] "1.9492800000000001E-3" "2.1281199999999998E-3" "3.04017E-3" "1.5558499999999999E-3" ...
 $ Avg Temp Peak Growth-10 Rate                                                                           : num [1:3144] 8.41 7.41 6.86 5.88 2.25 ...
 $ Avg Temp 10 Before First-Current                                                                       : num [1:3144] 8.33 7.78 7.1 6.68 2.55 ...
 $ Avg Temp First-Current                                                                                 : num [1:3144] 9.23 8.36 7.82 7.53 3.34 ...
 $ First Case                                                                                             : POSIXct[1:3144], format: "2020-03-02" "2020-03-05" ...
 $ Stay At Home                                                                                           : POSIXct[1:3144], format: "2020-03-22" "2020-03-22" ...
 $ No Cases                                                                                               : num [1:3144] 0 0 0 0 0 0 0 0 0 0 ...
 $ No Stay At Home Order                                                                                  : num [1:3144] 0 0 0 0 0 0 0 0 0 0 ...
 $ Stay At Home Order After First Case                                                                    : num [1:3144] 1 1 1 1 1 1 1 1 1 1 ...

The Covid19 dataset has 82 columns and 3144 rows/entries, for a total of 257808 individual data points. Out of 82, we select the following variables to do EDA:

  1. Province

  2. State

  3. State Code

  4. Tests

  5. Total cases

  6. Deaths

  7. Population (for demographic %’s)

  8. % less than 18 years of age

  9. % 65 and over

  10. % Black

  11. % American Indian & Alaska Native

  12. % Asian

  13. % Native Hawaiian/Other Pacific Islander

  14. % Hispanic

  15. % Non-Hispanic White

  16. % Not Proficient in English

  17. % Female

  18. No Cases

  19. No Stay At Home Order

  20. Stay At Home Order After First Case

  21. Percentage Living in Poverty

  22. Social Association Rate

  23. % Sleep Hour < 7

  24. % Sleep Hour < 7 Confidence Intervial low

  25. % Sleep Hour < 7 Confidence Intervial high

  26. % Diabetes Total Percentage

  27. % Diabetes Total Male Percentage

  28. % Diabetes Total Female Percentage

  29. Coronary Heart Death Rate per 100,000 people

  30. Hyperten slon Death Rate per 100,000 people

  31. % Obesity Age adjuseted

  32. % Fair or Poor Heath

  33. Average number of Physically Unhealthy Days

  34. Average number of Mentally Unhealthy Days

  35. % Low Birthweight

  36. % Smokers

  37. % Adults with Obesity

  38. Food Environment Index

  39. % Physically Inactive

  40. % With Access to Exercise Opportunities

  41. % Excessive Drinking

  42. % Unisured

  43. Preventable Hospitalization Rate

  44. % With Annual Mammogram

  45. % Flu Vaccinated

  46. Chronic Respiratory Disease per 100,000 people

  47. Liver Disease: crude mortality per 100,000 people

  48. Liver Disease: % of Total death

To prepare our data for EDA we clean the dataset and remove all NAs.

1.2 How it was gathered?

This data set was collected using publicly available data sources (Li et al., 2020). The following table shows the sources for each factor (Li et al., 2020).

Information of data

Information of data

1.3 What are the limitations of the dataset?

There are many limitations to this data set. In fact, this data set does not represent the entire States of the US since it only includes the counties with sufficient data. Hence, many counties were excluded from the analysis for lack of COVID-19 cases or lack of COVID-19 deaths. Another limitation is that the COVID-19 pandemic is still ongoing and the last date that the data set was collected was April 14, 2020. This means that the data of all included counties in this data set is at different stages which may not reflect the ultimate case and death tolls in these counties.

1.5 How did the research you gathered contribute to your question development?

Our group was interested in looking further into data related to COVID-19 because it was a very timely and relevant topic as we are living in this new normal. Before researching what data sets were out there, we brainstormed possible COVID-19 topics we could begin to delve in. A few of those topics discussed were hospital utilization during COVID-19, policy, and testing. We each found data sets that could be best related to the topics we have discussed and chose the one that best fit the requirements for our project.

After reviewing our chosen data set and having background knowledge that there are indeed factors that could impact the death rate and total cases cases, we chose our ultimate SMART question, “What are factors that lead to an increase in COVID-19 cases and death rates?” After learning more about the virus and potential factors, we understood that risk factors like health conditions, race, age, adherence to policies, and etc. could have an impact on the total cases of COVID-19. Our question also originally derived from pure curiosity, since we did have knowledge on the virus beforehand. We wanted to find the answers to our SMART question ourselves from our own data analysis. With the information that we did know, we came up with sub-questions, which will be further discussed later on, to help determine our final SMART question and make sure our final question was able to be determined through the variables in our data set.

1.6 What additional information would have been beneficial?

Additional information in our data set could have been beneficial to aid our analysis. For example, it would have been helpful to have testing information for each city and not by state as in the data set. This variable was not consistent with the rest of the data collection in the set. It also would have been beneficial to have ongoing data, rather than it ending in April to have the most updated data. Since scientists are finding new information on COVID-19 as time goes on, our results could have been more relevant if we had more recent data because the results may have shown different conclusions. It also may have been more helpful to have information on what each variable specifically is in the data set, such as in a guide, to prevent any guess work.

2 EDA Part

2.1 How did your question change, if at all, after EDA?

In the beginning we asked a few questions, such as “Which race is the majority of the sample?” and “Are patient from a certain race?” In the EDA study, we deleted the last sentence because this is an overall study on the COVID-19 epidemic in different regions of the United States, not a study on the individual participant. We just cannot determine the race of each confirmed individual.

We also deleted the question “which race has the most average death rate and total cases?”. The reason is the same as above, because we cannot determine the situation of each individual and cannot make statistics analysis on this problem. We can only observe the correlation coefficients between total cases, death and different proportion of races based on the correlation coefficient graph. Therefore, we changed the question to “The proportion of which race is related to the number of confirmed cases and deaths?”.

In addition, we added a few more questions, “Are the total cases related to age/gender/Poverty?”. To answer these questions, we first divided total cases into four levels, and then found that the average values of these variables at different levels are significantly different. Thus, we determined those variables are related to total cases.

We also set another question at the beginning of our analysis, “Are there any general trends among the health conditions?”. Studies have shown that the correlation coefficient between health (such as sleep status, medical history of various diseases, smoking, obesity, etc.) and death is not large. Only the correlation coefficient between liver_total_death and death is relatively high.

Finally, we deleted the question “Are there any common underlying health conditions?” and changed it to “Does any disease relate to the death rate?”.

2.2 Based on EDA can you begin to sketch out an answer to your question?

2.2.1 United States COVID-19 Cases and Deaths by Provinces (Cities)

2.2.1.1 What are the top 15 Provinces based on the number of cases?

The following bar chart shows the top 15 cities by number of COVID-19 cases.

The above Bar chart shows the top 15 provinces determined by the number of cases. New York province is highest city with number of COVID-19 cases, the total number is over 100000, while the number of cases in other cities is less than 30000.

2.2.1.2 What are the top 15 Provinces based on the number of deaths?

The following bar chart shows the top 15 cities by number of deaths.

The above Bar chart shows the top 15 provinces determined by the number of deaths. New York province is the highest city with number of deaths around 8000, while the number of deaths in other cities is less than 1000.

2.2.1.3 What are the top 15 States based on the number of Tests?

The above Bar chart shows the top 15 States determined by the number of tests. It can be clearly seen that the number of tests has been done in New York State is around 499,143 tests which is considered to be the highest among the other states. Furthermore, the number of test has been done in other states is less than 200k.

2.2.1.4 What is the average cases for each State?

                  State total_cases
1               Alabama       59.03
2                Alaska        9.83
3               Arizona      258.60
4              Arkansas       19.44
5            California      437.43
6              Colorado      122.41
7           Connecticut     1682.00
8              Delaware      638.33
9  District of Columbia     2058.00
10              Florida      323.07
11              Georgia       85.74
12               Hawaii      101.60
13                Idaho       33.32
14             Illinois      227.48
15              Indiana       94.12
16                 Iowa       19.20
17               Kansas       13.84
18             Kentucky       17.32
19            Louisiana      335.34
20                Maine       45.88
21             Maryland      394.75
22        Massachusetts     1843.87
23             Michigan      316.96
24            Minnesota       19.10
25          Mississippi       37.68
26             Missouri       40.98
27              Montana        7.18
28             Nebraska        9.48
29               Nevada      184.35
30        New Hampshire      103.50
31           New Jersey     3196.29
32           New Mexico       40.82
33             New York     3274.52
34       North Carolina       51.20
35         North Dakota        6.45
36                 Ohio       82.81
37             Oklahoma       28.51
 [ reached 'max' / getOption("max.print") -- omitted 14 rows ]

2.2.1.5 What is the average deaths for each State?

                  State  deaths
1               Alabama   1.701
2                Alaska   0.172
3               Arizona   7.133
4              Arkansas   0.427
5            California  13.328
6              Colorado   5.109
7           Connecticut  83.375
8              Delaware  14.333
9  District of Columbia  67.000
10              Florida   7.836
11              Georgia   3.270
12               Hawaii   1.800
13                Idaho   0.750
14             Illinois   8.510
15              Indiana   4.207
16                 Iowa   0.444
17               Kansas   0.657
18             Kentucky   0.900
19            Louisiana  15.922
20                Maine   1.250
21             Maryland  12.667
22        Massachusetts  49.600
23             Michigan  21.133
24            Minnesota   0.920
25          Mississippi   1.366
26             Missouri   1.284
27              Montana   0.143
28             Nebraska   0.161
29               Nevada   7.059
30        New Hampshire   0.300
31           New Jersey 133.476
32           New Mexico   0.939
33             New York 174.871
34       North Carolina   1.130
35         North Dakota   0.151
36                 Ohio   3.705
37             Oklahoma   1.403
 [ reached 'max' / getOption("max.print") -- omitted 14 rows ]

2.2.1.6 Which cities had the greatest % of population of people with poor health?

2.2.2 Patient Demographics

2.2.2.1 What are the patient demographics?

[1] "D:/study/6101/repo/Data_Science"
Table: Statistics summary.
TC Population young old black AIAN Asian NH Hispanic NHW Female Poverty Social
Min 0 88 0.0 4.8 0.0 0.0 0.0 0.0 0.6 2.7 26.8 3.4 0.0
Q1 2 11034 20.1 16.3 0.7 0.4 0.5 0.0 2.4 64.7 49.4 11.4 8.2
Median 9 25758 22.1 19.0 2.2 0.6 0.7 0.1 4.4 83.5 50.3 14.8 11.1
Mean 191 105871 22.1 19.3 8.8 2.4 1.5 0.1 9.6 76.2 49.9 15.9 11.6
Q3 39 67013 23.8 21.8 9.6 1.3 1.4 0.1 9.9 92.3 51.0 19.0 14.4
Max 110465 10105518 42.0 57.6 85.4 92.5 43.4 48.9 96.4 97.9 56.9 48.6 52.3

From the average of the output results, we can see that the average proportion of teenagers under the age of 18 is 22.1%, and the average proportion of people over 65 is 19.3%. The largest number of all races is Non-Hispanic White, with an average proportion of 76.2. The average proportion of women is 49.9, the average proportion of the poor is 15.9%, and the average of the Social Association Rate is 11.6. We divide the data into four levels according to total cases.

2.2.2.2 Which race is the majority of the sample?

According to the average value, we get a pie chart of race proportions, from which we can see the overall proportions of different races.

2.2.3 Stay at home policy in each province

2.2.4 Underlying Health Conditions

2.2.4.1 Does any disease relate to the death rate?

This shows liver_total_death is highly correlated to number of deaths at correlation = 0.4338.

2.2.5 Impact of Temperature

2.2.5.1 Does the temperature relate the Total Cases or Death Rate?

tibble [3,144 x 8] (S3: tbl_df/tbl/data.frame)
 $ Province    : chr [1:3144] "New York City" "Nassau" "Suffolk" "Westchester" ...
 $ State       : chr [1:3144] "New York" "New York" "New York" "New York" ...
 $ days        : num [1:3144] 44 41 38 43 82 35 41 80 39 38 ...
 $ total_cases : num [1:3144] 110465 25250 22691 20191 16323 ...
 $ deaths      : num [1:3144] 7905 1001 608 596 577 ...
 $ temp_peak   : num [1:3144] 8.41 7.41 6.86 5.88 2.25 ...
 $ temp_before : num [1:3144] 8.33 7.78 7.1 6.68 2.55 ...
 $ temp_current: num [1:3144] 9.23 8.36 7.82 7.53 3.34 ...

According to the correlation diagram, the temperature is less related to total_cases and number of deaths.

[1]      0      2      9     39 110465

3 Models

3.1 How did you select and determine the correct model to answer your question?

Below is how we selected and determined the correct model to answer the question.

3.1.1 Linear model

[1] "D:/study/6101/repo/Data_Science"


Call:
lm(formula = deaths ~ Population.Density + GDP + SHP + sleep_hour + 
    poorhealth, data = lineardf3)

Residuals:
   Min     1Q Median     3Q    Max 
-195.5   -6.3   -1.5    3.0  914.0 

Coefficients:
                         Estimate     Std. Error t value             Pr(>|t|)
(Intercept)        -41.1848758118   6.8018600120   -6.05        0.00000000161
Population.Density   0.0022486139   0.0006386507    3.52              0.00044
GDP                  0.0000006479   0.0000000319   20.29 < 0.0000000000000002
SHP                  1.0518973957   0.2137480942    4.92        0.00000091424
sleep_hour           1.5090360066   0.2395795052    6.30        0.00000000035
poorhealth          -1.2298052365   0.2083461901   -5.90        0.00000000404
                      
(Intercept)        ***
Population.Density ***
GDP                ***
SHP                ***
sleep_hour         ***
poorhealth         ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 36.8 on 2584 degrees of freedom
Multiple R-squared:  0.238, Adjusted R-squared:  0.236 
F-statistic:  161 on 5 and 2584 DF,  p-value: <0.0000000000000002
Population.Density                GDP                SHP         sleep_hour 
              1.22               1.24               1.29               1.67 
        poorhealth 
              1.81 

We use the regsubsets function, exhaustive method, to find the best model from two perspectives: BIC and adjusted R-squared. Both methods point to the same model, which contains five variables: Population Density per Square mile of Land, GDP (2018),% Severe Housing Problems, Sleep <7 Hours_Percent,% Fair or Poor Health. From the p-value, these five variables are all significant. VIF shows that these five variables have no high degree of autocorrelation and can be left in the model. The adjusted r-squared is 0.236, indicating that the model explained 23.6% of the variation in death.

Final model: death=-41.185+0.002 Population.Density + 0.0000006479 GDP + 1.051 SHP +1.509 sleep_hour + -1.230 poorhealth


Call:
lm(formula = deaths ~ sleep_hour + sleep_hour_high + Low_birthweight + 
    heart_disease + adult_obesity + Food_environment + Respiratory + 
    liver_Total_death, data = disease5)

Residuals:
   Min     1Q Median     3Q    Max 
-133.8  -18.3   -3.9   10.1  862.2 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)        -168.5145    33.2787   -5.06          0.000000487 ***
sleep_hour           40.0547     9.8190    4.08          0.000048666 ***
sleep_hour_high     -36.2404     9.7488   -3.72              0.00021 ***
Low_birthweight       3.9123     1.3083    2.99              0.00285 ** 
heart_disease         0.2889     0.0688    4.20          0.000028735 ***
adult_obesity        -1.4105     0.4212   -3.35              0.00084 ***
Food_environment     13.4061     2.3950    5.60          0.000000028 ***
Respiratory          -0.6319     0.1407   -4.49          0.000007913 ***
liver_Total_death 12559.0125  1304.9922    9.62 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 56 on 1028 degrees of freedom
Multiple R-squared:  0.279, Adjusted R-squared:  0.273 
F-statistic: 49.7 on 8 and 1028 DF,  p-value: <0.0000000000000002
       sleep_hour   sleep_hour_high   Low_birthweight     heart_disease 
           407.69            409.70              2.07              1.35 
    adult_obesity  Food_environment       Respiratory liver_Total_death 
             1.84              1.95              1.51              1.39 

Call:
lm(formula = deaths ~ sleep_hour + Low_birthweight + heart_disease + 
    adult_obesity + Food_environment + Respiratory + liver_Total_death, 
    data = disease5)

Residuals:
   Min     1Q Median     3Q    Max 
-161.0  -18.2   -4.8    8.7  872.0 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)        -181.1828    33.3093   -5.44         0.0000000668 ***
sleep_hour            3.6256     0.6217    5.83         0.0000000073 ***
Low_birthweight       4.4785     1.3075    3.43              0.00064 ***
heart_disease         0.2550     0.0686    3.72              0.00021 ***
adult_obesity        -1.6309     0.4196   -3.89              0.00011 ***
Food_environment     12.8099     2.4045    5.33         0.0000001223 ***
Respiratory          -0.7153     0.1398   -5.12         0.0000003714 ***
liver_Total_death 14469.7867  1206.9498   11.99 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 56.4 on 1029 degrees of freedom
Multiple R-squared:  0.269, Adjusted R-squared:  0.264 
F-statistic: 54.1 on 7 and 1029 DF,  p-value: <0.0000000000000002
       sleep_hour   Low_birthweight     heart_disease     adult_obesity 
             1.61              2.04              1.33              1.81 
 Food_environment       Respiratory liver_Total_death 
             1.94              1.48              1.17 

We use all variables to fit a linear model and select variables based on BIC and adjusted R-squared. The linear model we get contains variables: Sleep_hour, Sleep_hour_high, Heart_disease, Low_birthweight, adult_obesity, Food_enronment, Respiratory and liver_Total_death. In the above two models, all the p value of variables is less than 0.05, which is significant, but the VIF of skeep_hour_high and sleep_hour is too large and highly autocorrelated. So we remove sleep_hour_high and use the other variables to fit the third model.

                      Abbreviation
sleep_hour                     sl_
sleep_hour_low            slp_hr_l
sleep_hour_high           slp_hr_h
diabetes                        db
diabetes_male               dbts_m
diabetes_female             dbts_f
heart_disease                    h
Hypertension                     H
obesity_age                      o
Poor_health                    Pr_
Physical_Unhealthy             P_U
Mentally_unhealthy              M_
Low_birthweight                  L
smokers                         sm
adult_obesity                    a
Food_environment               Fd_
Physically_inactive            Ph_
Exercise_opportunity             E
excessive_drink                  e
Uninsured                        U
Preventable_Hospital           P_H
Mammogram                       Mm
Flu_vaccinated                 Fl_
Respiratory                      R
liver_crude_mortality          l__
liver_Total_death              l_T

By the Mallo Cp diagorm, the left subset size is 1 and right subset size is 6. When variables reach to seven the Cp-value reach the lowest point around. And the variables are exactly the same as we selected above.

3.1.2 LASSO Regression

Because there are many variables, Lasso regression is chosen to fit the best model. Lasso regression can change the coefficients of many variables to 0, which plays a role in variable selection.

[1] "D:/study/6101/repo/Data_Science"

lowest lamda from CV:  0.00246 

We see that the lowest MSE is when \(\lambda\) appro = 0.002.

Mean MSE for best Lasso lamda:  0.203 

All the coefficients : 
       (Intercept)         population              young                old 
          -0.00301            0.26499            0.03709            0.02258 
             black               AIAN              Asian                 NH 
           0.00000           -0.00339           -0.09691           -0.00689 
          Hispanic                NHW             Female              Rural 
          -0.00461            0.00751           -0.01883            0.02263 
Population.Density 
           0.11744 

The non-zero coefficients : 
       (Intercept)         population              young                old 
          -0.00301            0.26499            0.03709            0.02258 
              AIAN              Asian                 NH           Hispanic 
          -0.00339           -0.09691           -0.00689           -0.00461 
               NHW             Female              Rural Population.Density 
           0.00751           -0.01883            0.02263            0.11744 

From LASSO regression, the coefficients of 11 variables are not zero, the coefficients of the remaining variables become zero. From the results, we can see that race, gender, age, population, population density and rural proportions will all have an impact on total cases.

We then calculate the R squared of lasso regression, which is 0.164.

3.2 What prediction can you make with your model?

For the tempreture part:

tibble [3,144 x 8] (S3: tbl_df/tbl/data.frame)
 $ Province    : chr [1:3144] "New York City" "Nassau" "Suffolk" "Westchester" ...
 $ State       : chr [1:3144] "New York" "New York" "New York" "New York" ...
 $ days        : num [1:3144] 44 41 38 43 82 35 41 80 39 38 ...
 $ total_cases : num [1:3144] 110465 25250 22691 20191 16323 ...
 $ deaths      : num [1:3144] 7905 1001 608 596 577 ...
 $ temp_peak   : num [1:3144] 8.41 7.41 6.86 5.88 2.25 ...
 $ temp_before : num [1:3144] 8.33 7.78 7.1 6.68 2.55 ...
 $ temp_current: num [1:3144] 9.23 8.36 7.82 7.53 3.34 ...

According to the correlation diagram, the temperature is less related to total_cases and deaths. However, it does relate to the variable, days. This means the higher the temperature in the city, the later the disease occurs in this city. For prediction, the second peak is coming earlier as the weather turns from summer to winter, which means the temperature goes down. As the temperature becomes colder, the second peak of the spread comes earlier.

3.3 How reliable are your results?

# A tibble: 6 x 9
  total_cases deaths sleep_hour heart_disease Low_birthweight adult_obesity
        <dbl>  <dbl>      <dbl>         <dbl>           <dbl>         <dbl>
1       25250   1001       38.0         142.             7.89          23.6
2       22691    608       35.6         120.             7.74          24.6
3       20191    596       33.1          97.6            7.95          20.7
4       16323    577       33.4          95.2            8.96          28  
5       12209    820       41.5         157.            10.7           34.7
6       10426    550       38.4          78.5            7.83          22.3
# ... with 3 more variables: Food_environment <dbl>, Respiratory <dbl>,
#   liver_Total_death <dbl>
Data:   X dimension: 1037 8 
    Y dimension: 1037 1
Fit method: svdpc
Number of components considered: 8

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV           65.77    63.38    43.69    40.53    37.79    37.95    26.35
adjCV        65.77    63.40    43.59    40.37    37.71    38.61    26.13
       7 comps  8 comps
CV       24.24    23.16
adjCV    24.14    23.09

TRAINING: % variance explained
        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
X        38.227    57.98    70.52    78.37    85.00    91.23    96.28   100.00
deaths    7.447    57.73    65.81    70.13    70.94    84.80    87.19    88.69
      total_cases        sleep_hour     heart_disease   Low_birthweight 
             1.84             -3.86             -3.19             -4.07 
    adult_obesity  Food_environment       Respiratory liver_Total_death 
            -4.64              4.34             -3.95              2.05 
      total_cases        sleep_hour     heart_disease   Low_birthweight 
          29.3934           10.0243            9.5902            0.0375 
    adult_obesity  Food_environment       Respiratory liver_Total_death 
          -5.4036            3.8446           -6.3601           26.2150 
   1    2    3    4    5    6    7    8 
63.4 62.0 66.4 66.0 16.3 52.7 89.4 49.0 
  1   2   3   4   5 
529 477 403 551 370 
  deaths     PC1   PC2   PC3    PC4    PC5   PC6    PC7    PC8
1  14.97 -4.5542  9.77 -5.52  4.688 -2.830 6.675 -2.071 -1.739
2   8.99 -4.4172  8.83 -4.70  3.523 -1.910 6.130 -2.319 -1.719
3   8.81 -4.8449  7.45 -3.25  3.167 -2.269 5.943 -1.904 -1.562
4   8.52 -4.8026 11.20 -3.72 -1.834  1.753 0.971 -3.210 -0.165
5  12.22  0.0448  7.93 -3.19  0.242 -0.374 1.186 -1.297 -0.641
6   8.11 -3.5058  4.57 -1.23  2.697 -0.725 2.278  0.186 -0.490

For the linear model with death related to disease, I added the total cases as the correlation of death and total cases are autocorrelated. By the Principle Component Method, we draw the cross-validation line and compare the MSEP of the death to the significant disease. The number of components drops down when the number of components reach to two, which means the pc1 and pc2 can highly explain the model. This shows the reliability of the the model is sufficient.

3.4 What additional information or analysis might improve your model results or work to control limitations?

The random forest test is easy to understand and interpret and is insensitive to outliers, which means there is no need to remove any outliers. Also, this method is effective on handling the missing data. It also has low operation computation and requires little data pre-processing. It is well-suited for large data sets with many rows and missing data. However, as the trees become too large, the difficulty of interpretation will increase. The variance of tree method is high and the performance is low. The model can easily overfit.

In addition, the data is not perfect enough to predict death rate since it is outdated. This data set’s last update is from April and the death number has increased greatly since then until now. Also, there are lots of missing data, especially regarding the disease. Only around 20 states out of 51 have a full record of the disease history for patients. These reasons make the available analysis data drop down from 3000 rows to 1000 row after deleting the missing data. Also, the death of COVID-19 is not complete and may not be accurate, since many potential patients may have passed away due to COVID-19, but it could have been counted as death due to the normal Flu. Finally, some important categories were not included in the analysis and data set, such as blood type.

4 Conclusion

After conducting our EDA analysis and models, we came to several conclusions on what factors could have an impact on the number of cases and death rate due to COVID-19. First, we found a significant relation of population density; the higher population density, the increased number of cases. This may be because more people may go out in crowded areas or take public transportation, which could lead to more exposure to the disease. Ethnicity also had an effect on the number of cases and death rates. Looking more closely into this variable, we saw that the top three races with an impact were Black, Asian, and Hispanic, which are all minority races. Minorities tend to have predisposed conditions that may lead to more exposure to the disease, thus increase the number of cases and deaths. Examples include social determinants of health and lack of access to care. Another factor was rural areas and confirmed cases. Those patients in these areas may have higher rates of a chronic disease that were not tested in this data set that could have an association with getting COVID-19. People in rural areas also face social inequities and social determinants of health that could lead to an increased exposure to the disease. Age was also shown to have an impact on the number of cases. The younger generation may be going out more, not follow enforcement or mandate polices, and/or have a mindset they are invulnerable to the diease, which all lead to increased exposure to the diaease. A couple of other variables shown to have an impact on the number of cases were gender and liver-total-death.

In the linear model, we found that the coefficient of % Severe Housing Problems and Sleep <7 Hours_Percent is positive, indicating that the less sleep time and housing problems there is in a region, the higher the mortality rate will be. Sleep time may affect health conditions, and good housing conditions can ensure that people can minimize exposure to the virus.

5 Bibliography

Assessing Risk Factors for Severe COVID-19 Illness. (n.d.). Retrieved August 09, 2020, from https://www.cdc.gov/coronavirus/2019-ncov/covid-data/investigations-discovery/assessing-risk-factors.html

Li, Adam Y, et al. “Multivariate Analysis of Factors Affecting COVID-19 Case and Death Rate in U.S. Counties: The Significant Effects of Black Race and Temperature.” MedRxiv, Cold Spring Harbor Laboratory Press, 1 Jan. 2020, www.medrxiv.org/content/10.1101/2020.04.17.20069708v2.